Extraction of Layout Entities and Sub-layout Query-based Retrieval of Document Images

نویسندگان

  • Anukriti Bansal
  • Sumantra Dutta Roy
  • Gaurav Harit
چکیده

Layouts and sub-layouts constitute an important clue while searching a document on the basis of its structure, or when textual content is unknown/irrelevant. A sub-layout specifies the arrangement of document entities within a smaller portion of the document. We propose an efficient graph-based matching algorithm, integrated with hash-based indexing, to prune a possibly large search space. A user can specify a combination of sub-layouts of interest using sketch-based queries. The system supports partial matching for unspecified layout entities. We handle cases of segmentation pre-processing errors (for text/non-text blocks) with a symmetry maximization-based strategy, and accounting for multiple domain-specific plausible segmentation hypotheses. We show promising results of our system on a database of unstructured entities, containing 4776 newspaper images.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Layout Based Information Retrieval from Document Images

This research is intended to develop a layout based retrieval system for document image databases consisting of three phases: 1. At first, intelligent layout analysis algorithm has been designed to extract the layouts the document images physically with their edges and rectangles. 2. Every physically identified layout has been converted into a tree intermediary representation for indexing and s...

متن کامل

Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field

We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes text, background and image are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on t...

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Document Image Retrieval Based on Layout Structural Similarity

In this paper, we describe issues related to the measurement of structural similarity between document images. We define structural similarity, and discuss the benefits of using it as a complement to content similarity for querying document image databases. We present an approach to computing a geometrically invariant structural similarity, and use this measure to search document image database...

متن کامل

Unconstrained Tight Structure Extraction Using Voronoi Tesselation on Document Images

Document structure is the intermediary result obtained through page segmentation, which is used in the analysis of the document image. The structure serves the purpose of extracting the shape of the document from paragraph up to character level in a hierarchical exploratory methodology for understanding the layout structure of the document image. The extracted layout forms a dominant feature wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1609.02687  شماره 

صفحات  -

تاریخ انتشار 2016